<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Copyright (C) 2006-2023 Free Software Foundation, Inc.

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with the
Invariant Sections being "Funding Free Software", the Front-Cover
texts being (a) (see below), and with the Back-Cover Texts being (b)
(see below).  A copy of the license is included in the section entitled
"GNU Free Documentation License".

(a) The FSF's Front-Cover Text is:

A GNU Manual

(b) The FSF's Back-Cover Text is:

You have freedom to copy and modify this GNU Manual, like GNU
     software.  Copies published by the Free Software Foundation raise
     funds for GNU development. -->
<!-- Created by GNU Texinfo 6.7, http://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>nvptx (GNU libgomp)</title>

<meta name="description" content="nvptx (GNU libgomp)">
<meta name="keywords" content="nvptx (GNU libgomp)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<link href="index.html" rel="start" title="Top">
<link href="Library-Index.html" rel="index" title="Library Index">
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Offload_002dTarget-Specifics.html" rel="up" title="Offload-Target Specifics">
<link href="The-libgomp-ABI.html" rel="next" title="The libgomp ABI">
<link href="AMD-Radeon.html" rel="prev" title="AMD Radeon">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
ul.no-bullet {list-style: none}
-->
</style>


</head>

<body lang="en">
<span id="nvptx"></span><div class="header">
<p>
Previous: <a href="AMD-Radeon.html" accesskey="p" rel="prev">AMD Radeon</a>, Up: <a href="Offload_002dTarget-Specifics.html" accesskey="u" rel="up">Offload-Target Specifics</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Library-Index.html" title="Index" rel="index">Index</a>]</p>
</div>
<hr>
<span id="nvptx-1"></span><h3 class="section">12.2 nvptx</h3>

<p>On the hardware side, there is the hierarchy (fine to coarse):
</p><ul>
<li> thread
</li><li> warp
</li><li> thread block
</li><li> streaming multiprocessor
</li></ul>

<p>All OpenMP and OpenACC levels are used, i.e.
</p><ul>
<li> OpenMP&rsquo;s simd and OpenACC&rsquo;s vector map to threads
</li><li> OpenMP&rsquo;s threads (&ldquo;parallel&rdquo;) and OpenACC&rsquo;s workers map to warps
</li><li> OpenMP&rsquo;s teams and OpenACC&rsquo;s gang use a threadpool with the
      size of the number of teams or gangs, respectively.
</li></ul>

<p>The used sizes are
</p><ul>
<li> The <code>warp_size</code> is always 32
</li><li> CUDA kernel launched: <code>dim={#teams,1,1}, blocks={#threads,warp_size,1}</code>.
</li><li> The number of teams is limited by the number of blocks the device can
      host simultaneously.
</li></ul>

<p>Additional information can be obtained by setting the environment variable to
<code>GOMP_DEBUG=1</code> (very verbose; grep for <code>kernel.*launch</code> for launch
parameters).
</p>
<p>GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user&rsquo;s directory (see CUDA documentation; can be
tuned by the environment variables <code>CUDA_CACHE_{DISABLE,MAXSIZE,PATH}</code>.
</p>
<p>Note: While PTX ISA is generic, the <code>-mptx=</code> and <code>-march=</code> commandline
options still affect the used PTX ISA code and, thus, the requirements on
CUDA version and hardware.
</p>
<p>The implementation remark:
</p><ul>
<li> I/O within OpenMP target regions and OpenACC parallel/kernels is supported
      using the C library <code>printf</code> functions. Note that the Fortran
      <code>print</code>/<code>write</code> statements are not supported, yet.
</li><li> Compilation OpenMP code that contains <code>requires reverse_offload</code>
      requires at least <code>-march=sm_35</code>, compiling for <code>-march=sm_30</code>
      is not supported.
</li><li> For code containing reverse offload (i.e. <code>target</code> regions with
      <code>device(ancestor:1)</code>), there is a slight performance penalty
      for <em>all</em> target regions, consisting mostly of shutdown delay
      Per device, reverse offload regions are processed serially such that
      the next reverse offload region is only executed after the previous
      one returned.
</li><li> OpenMP code that has a requires directive with <code>unified_address</code>
      or <code>unified_shared_memory</code> will remove any nvptx device from the
      list of available devices (&ldquo;host fallback&rdquo;).
</li></ul>



<hr>
<div class="header">
<p>
Previous: <a href="AMD-Radeon.html" accesskey="p" rel="prev">AMD Radeon</a>, Up: <a href="Offload_002dTarget-Specifics.html" accesskey="u" rel="up">Offload-Target Specifics</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Library-Index.html" title="Index" rel="index">Index</a>]</p>
</div>



</body>
</html>
